Compact Indexes for Flexible Top- k k Retrieval
نویسندگان
چکیده
We engineer a self-index based retrieval system capable of rank-safe evaluation of top-k queries. The framework generalizes the GREEDY approach of Culpepper et al. (ESA 2010) to handle multiterm queries, including over phrases. We propose two techniques which significantly reduce the ranking time for a wide range of popular Information Retrieval (IR) relevance measures, such as TF×IDF and BM25. First, we reorder elements in the document array according to document weight. Second, we introduce the repetition array, which generalizes Sadakane’s (JDA 2007) document frequency structure to document subsets. Combining document and repetition array, we achieve attractive functionality-space trade-offs. We provide an extensive evaluation of our system on terabyte-sized IR collections.
منابع مشابه
Improved Compressed Indexes for Full-Text Document Retrieval
We give new space/time tradeoffs for compressed indexes that answer document retrieval queries on general sequences. On a collection of D documents of total length n, current approaches require at least |CSA| + O(n lgD lg lgD ) or 2|CSA| + o(n) bits of space, where CSA is a full-text index. Using monotone minimum perfect hash functions, we give new algorithms for document listing with frequenci...
متن کاملA General Document Retrieval in Compact Space
Given a collection of documents and a query pattern, document retrieval is the problem of obtaining documents that are relevant to the query. The collection is available beforehand so that a data structure, called an index, can be built on it to speed up queries. While initially restricted to natural language text collections, document retrieval problems arise nowadays in applications like bioi...
متن کاملEfficient Retrieval of the Top-k Most Relevant Spatial Web Objects
The conventional Internet is acquiring a geo-spatial dimension. Web documents are being geo-tagged, and geo-referenced objects such as points of interest are being associated with descriptive text documents. The resulting fusion of geo-location and documents enables a new kind of top-k query that takes into account both location proximity and text relevancy. To our knowledge, only naive techniq...
متن کامل$L^p$-Conjecture on Hypergroups
In this paper, we study $L^p$-conjecture on locally compact hypergroups and by some technical proofs we give some sufficient and necessary conditions for a weighted Lebesgue space $L^p(K,w)$ to be a convolution Banach algebra, where $1<p<infty$, $K$ is a locally compact hypergroup and $w$ is a weight function on $K$. Among the other things, we also show that if $K$ is a locally compact hyper...
متن کاملK 2-Treaps: Range Top-k Queries in Compact Space
Efficient processing of top-k queries on multidimensional grids is a common requirement in information retrieval and data mining, for example in OLAP cubes. We introduce a data structure, the K-treap, that represents grids in compact form and supports efficient prioritized range queries. We compare the K-treap with state-of-the-art solutions on synthetic and real-world datasets, showing that it...
متن کامل